Automatic Comma Insertion of Lecture Transcripts Based on Multiple Annotations

نویسندگان

  • Yuya Akita
  • Tatsuya Kawahara
چکیده

To enhance readability and usability of speech recognition results, automatic punctuation is an essential process. In this paper, we address automatic comma prediction based on conditional random fields (CRF) using lexical, syntactic and pause information. Since there is large disagreement in comma insertion between humans, we model individual tendencies of punctuation using annotations given by multiple annotators, and combine these models by voting and interpolation frameworks. Experimental evaluations on real lecture speech demonstrated that the combination of individual punctuation models achieves higher prediction accuracy for commas agreed by all annotators and those given by individual annotators.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic Comma Insertion for Japanese Text Generation

This paper proposes a method for automatically inserting commas into Japanese texts. In Japanese sentences, commas play an important role in explicitly separating the constituents, such as words and phrases, of a sentence. The method can be used as an elemental technology for natural language generation such as speech recognition and machine translation, or in writing-support tools for non-nati...

متن کامل

Extension of the LECTRA corpus: classroom LECture TRAnscriptions in European Portuguese

This paper presents the recent extension of the LECTRA corpus, a speech corpus of university lectures in European Portuguese that will be partially available. Eleven additional hours of various lectures were transcribed, following the previous multilayer annotations, and now comprising about 32 hours. This material can be used not only for the production of multimedia lecture contents for e-lea...

متن کامل

Detecting Commas in Slovak Legal Texts

This paper reports on initial experiments with automatic comma recovery in legal texts. In deciding whether to insert a comma or not, we propose to use the value of the probability of a bigram of two words without a comma and a trigram of the words with the comma. The probability is determined by the language model trained on sentences with commas labeled as separate words. In the training data...

متن کامل

The Effects of Multimedia Annotations on Iranian EFL Learners’ L2 Vocabulary Learning

In our modern technological world, Computer-Assisted Language learning (CALL) is a new realm towards learning a language in general, and learning L2 vocabulary in particular. It is assumed that the use of multimedia annotations promotes language learners’ vocabulary acquisition. Therefore, this study set out to investigate the effects of different multimedia annotations (still picture annotatio...

متن کامل

Automatic identification and storage of significant points in a computer-based presentation

We describe an automatic classroom capture system that detects and records significant (stable) points in lectures by sampling and analyzing a sequence of screen capture frames from a PC used for presentations, application demonstrations, etc. The system uses visual inspection techniques to scan the screen capture stream to identify points to store. Unlike systems that only detect and store sli...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011